27 research outputs found

    A Benchmark Suite for Evaluating Parallel Programming Models : Introduction and Preliminary Results

    Get PDF
    The transition to multi-core processors enforces software developers to explicitly exploit thread-level parallelism to increase performance. The associated programmability problem has led to the introduction of a plethora of parallel programming models that aim at simplifying software development by raising the abstraction level. Since industry has not settled for a single model, however, multiple significantly different approaches exist. This work presents a benchmark suite which can be used to classify and compare such parallel programming models and, therefore, aids in selecting the appropriate programming model for a given task. After a detailed explanation of the suite's design, preliminary results for two programming models, Pthreads and OmpSs/SMPSs, are presented and analyzed, leading to an outline of further extensions of the suite.EC/FP7/248647/EU/ENabling technologies for a programmable many-CORE/ENCOR

    Programming parallel embedded and consumer applications in OpenMP superscalar

    Get PDF
    In this paper, we evaluate the performance and usability of the parallel programming model OpenMP Superscalar (OmpSs), apply it to 10 different benchmarks and compare its performance with corresponding POSIX threads implementations

    Using OpenMP superscalar for parallelization of embedded and consumer applications

    Get PDF
    In the past years, research and industry have introduced several parallel programming models to simplify the development of parallel applications. A popular class among these models are task-based programming models which proclaim ease-of-use, portability, and high performance. A novel model in this class, OpenMP Superscalar, combines advanced features such as automated runtime dependency resolution, while maintaining simple pragma-based programming for C/C++. OpenMP Superscalar has proven to be effective in leveraging parallelism in HPC workloads. Embedded and consumer applications, however, are currently still mainly parallelized using traditional thread-based programming models. In this work, we investigate how effective OpenMP Superscalar is for embedded and consumer applications in terms of usability and performance. To determine the usability of OmpSs, we show in detail how to implement complex parallelization strategies such as ones used in parallel H.264 decoding. To evaluate the performance we created a collection of ten embedded and consumer benchmarks parallelized in both OmpSs and Pthreads.EC/FP7/248647/EU/ENabling technologies for a programmable many-CORE/ENCOR

    On latency in GPU throughput microarchitectures

    Get PDF
    Modern GPUs provide massive processing power (arithmetic throughput) as well as memory throughput. Presently, while it appears to be well understood how performance can be improved by increasing throughput, it is less clear what the effects of micro-architectural latencies are on the performance of throughput-oriented GPU architectures. In fact, little is publicly known about the values, behavior, and performance impact of microarchitecture latency components in modern GPUs. This work attempts to fill that gap by analyzing both the idle (static) as well as loaded (dynamic) latency behavior of GPU microarchitectural components. Our results show that GPUs are not as effective in latency hiding as commonly thought and based on that, we argue that latency should also be a GPU design consideration besides throughput

    Spatio-temporal SIMT and scalarization for improving GPU efficiency

    Get PDF
    Temporal SIMT (TSIMT) has been suggested as an alternative to conventional (spatial) SIMT for improving GPU performance on branch-intensive code. Although TSIMT has been briefly mentioned before, it was not evaluated. We present a complete design and evaluation of TSIMT GPUs, along with the inclusion of scalarization and a combination of temporal and spatial SIMT, named Spatiotemporal SIMT (STSIMT). Simulations show that TSIMT alone results in a performance reduction, but a combination of scalarization and STSIMT yields a mean performance enhancement of 19.6% and improves the energy-delay product by 26.2% compared to SIMT.EC/FP7/288653/EU/Low-Power Parallel Computing on GPUs/LPGP

    How a single chip causes massive power bills : GPUSimPow: A GPGPU power simulator

    Get PDF
    Modern GPUs are true power houses in every meaning of the word: While they offer general-purpose (GPGPU) compute performance an order of magnitude higher than that of conventional CPUs, they have also been rapidly approaching the infamous “power wall”, as a single chip sometimes consumes more than 300W. Thus, the design space of GPGPU microarchitecture has been extended by another dimension: power. While GPU researchers have previously relied on cycle-accurate simulators for estimating performance during design cycles, there are no simulation tools that include power as well. To mitigate this issue, we introduce the GPUSimPow power estimation framework for GPGPUs consisting of both analytical and empirical models for regular and irregular hardware components. To validate this framework, we build a custom measurement setup to obtain power numbers from real graphics cards. An evaluation on a set of well-known benchmarks reveals an average relative error of 11.7% between simulated and hardware power for GT240 and an average relative error of 10.8% for GTX580. The simulator has been made available to the public [1].EC/FP7/288653/EU/Low-Power Parallel Computing on GPUs/LPGP

    GPGPU workload characteristics and performance analysis

    Get PDF
    GPUs are much more power-efficient devices compared to CPUs, but due to several performance bottlenecks, the performance per watt of GPUs is often much lower than what could be achieved theoretically. To sustain and continue high performance computing growth, new architectural and application techniques are required to create power-efficient computing systems. To find such techniques, however, it is necessary to study the power consumption at a detailed level and understand the bottlenecks which cause low performance. Therefore, in this paper, we study GPU power consumption at component level and investigate the bottlenecks that cause low performance and low energy efficiency. We divide the low performance kernels into low occupancy and full occupancy categories. For the low occupancy category, we study if increasing the occupancy helps in increasing performance and energy efficiency. For the full occupancy category, we investigate if these kernels are limited by memory bandwidth, coalescing efficiency, or SIMD utilization.EC/FP7/288653/EU/Low-Power Parallel Computing on GPUs/LPGP

    CD171- and GD2-specific CAR-T cells potently target retinoblastoma cells in preclinical in vitro testing

    Get PDF
    BACKGROUND: Chimeric antigen receptor (CAR)-based T cell therapy is in early clinical trials to target the neuroectodermal tumor, neuroblastoma. No preclinical or clinical efficacy data are available for retinoblastoma to date. Whereas unilateral intraocular retinoblastoma is cured by enucleation of the eye, infiltration of the optic nerve indicates potential diffuse scattering and tumor spread leading to a major therapeutic challenge. CAR-T cell therapy could improve the currently limited therapeutic strategies for metastasized retinoblastoma by simultaneously killing both primary tumor and metastasizing malignant cells and by reducing chemotherapy-related late effects. METHODS: CD171 and GD2 expression was flow cytometrically analyzed in 11 retinoblastoma cell lines. CD171 expression and T cell infiltration (CD3+) was immunohistochemically assessed in retrospectively collected primary retinoblastomas. The efficacy of CAR-T cells targeting the CD171 and GD2 tumor-associated antigens was preclinically tested against three antigen-expressing retinoblastoma cell lines. CAR-T cell activation and exhaustion were assessed by cytokine release assays and flow cytometric detection of cell surface markers, and killing ability was assessed in cytotoxic assays. CAR constructs harboring different extracellular spacer lengths (short/long) and intracellular co-stimulatory domains (CD28/4-1BB) were compared to select the most potent constructs. RESULTS: All retinoblastoma cell lines investigated expressed CD171 and GD2. CD171 was expressed in 15/30 primary retinoblastomas. Retinoblastoma cell encounter strongly activated both CD171-specific and GD2-specific CAR-T cells. Targeting either CD171 or GD2 effectively killed all retinoblastoma cell lines examined. Similar activation and killing ability for either target was achieved by all CAR constructs irrespective of the length of the extracellular spacers and the co-stimulatory domain. Cell lines differentially lost tumor antigen expression upon CAR-T cell encounter, with CD171 being completely lost by all tested cell lines and GD2 further down-regulated in cell lines expressing low GD2 levels before CAR-T cell challenge. Alternating the CAR-T cell target in sequential challenges enhanced retinoblastoma cell killing. CONCLUSION: Both CD171 and GD2 are effective targets on human retinoblastoma cell lines, and CAR-T cell therapy is highly effective against retinoblastoma in vitro. Targeting of two different antigens by sequential CAR-T cell applications enhanced tumor cell killing and preempted tumor antigen loss in preclinical testing

    Combination of GD2-directed bispecific trifunctional antibody therapy with Pd-1 immune checkpoint blockade induces anti-neuroblastoma immunity in a syngeneic mouse model

    Get PDF
    Introduction: Despite advances in treating high-risk neuroblastoma, 50-60% of patients still suffer relapse, necessitating new treatment options. Bispecific trifunctional antibodies (trAbs) are a promising new class of immunotherapy. TrAbs are heterodimeric IgG-like molecules that bind CD3 and a tumor-associated antigen simultaneously, whereby inducing a TCR-independent anti-cancer T cell response. Moreover, via their functional Fc region they recruit and activate cells of the innate immune system like antigen-presenting cells potentially enhancing induction of adaptive tumor-specific immune responses. Methods: We used the SUREK trAb, which is bispecific for GD2 and murine Cd3. Tumor-blind trAb and the monoclonal ch14.18 antibody were used as controls. A co-culture model of murine dendritic cells (DCs), T cells and a neuroblastoma cell line was established to evaluate the cytotoxic effect and the T cell effector function in vitro. Expression of immune checkpoint molecules on tumor-infiltrating T cells and the induction of an anti-neuroblastoma immune response using a combination of whole cell vaccination and trAb therapy was investigated in a syngeneic immunocompetent neuroblastoma mouse model (NXS2 in A/J background). Finally, vaccinated mice were assessed for the presence of neuroblastoma-directed antibodies. We show that SUREK trAb-mediated effective killing of NXS2 cells in vitro was strictly dependent on the combined presence of DCs and T cells. Results: Using a syngeneic neuroblastoma mouse model, we showed that vaccination with irradiated tumor cells combined with SUREK trAb treatment significantly prolonged survival of tumor challenged mice and partially prevent tumor outgrowth compared to tumor vaccination alone. Treatment led to upregulation of programmed cell death protein 1 (Pd-1) on tumor infiltrating T cells and combination with anti-Pd-1 checkpoint inhibition enhanced the NXS2-directed humoral immune response. Conclusion: Here, we provide first preclinical evidence that a tumor vaccination combined with SUREK trAb therapy induces an endogenous anti-neuroblastoma immune response reducing tumor recurrence. Furthermore, a combination with anti-Pd-1 immune checkpoint blockade might even further improve this promising immunotherapeutic concept in order to prevent relapse in high-risk neuroblastoma patients
    corecore